It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003).
At its heart, data wrangling is the set of techniques to get data into a usable form for analysis and visualization.
Data wrangling is a catchall for data importing, data cleaning, and data transformation to create some product with the finished data.
All messy data is unhappy in its own way. To put some structure on data, we prefer that our data is tidy. In tidy data, every column is a variable, every row is an observation, and every cell is a single value (Wickham 2014).
Tidy datasets provide a standardized way to link the physical appearance of the dataset with its meaning (Wickham 2014). Datasets are made of rows and columns that contain a collection of values
Here is a dataset inspired by an Economist/YouGov Poll about the importance of unemployment in the United States.
challenge_1 <- tibble(
issue_importance = c("Very Important", "Somewhat Important",
"Not very Important", "Unimportant"),
`18-29` = c(.59, .28, .08, .05),
`30-44` = c(.66, .27, .06, .02),
`45-64` = c(.69, .28, .03, .01),
`65+` = c(.7, .27, .02, .01)
)
head(challenge_1)
## # A tibble: 4 × 5
## issue_importance `18-29` `30-44` `45-64` `65+`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Very Important 0.59 0.66 0.69 0.7
## 2 Somewhat Important 0.28 0.27 0.28 0.27
## 3 Not very Important 0.08 0.06 0.03 0.02
## 4 Unimportant 0.05 0.02 0.01 0.01
Answer the following:
How many observations are in the data set?
How many columns are in the data set?
Is every column a variable?
Consider the data below. Both data sets display information on heart rate observed in individuals across three different time periods. But the data are organized differently in each table. Which one of these do you think is the tidy format?
wide <- data.frame(
name = c("Wilbur", "Petunia", "Gregory"),
time1 = c(67, 80, 64),
time2 = c(56, 90, 50),
time3 = c(70, 67, 101)
)
wide
## name time1 time2 time3
## 1 Wilbur 67 56 70
## 2 Petunia 80 90 67
## 3 Gregory 64 50 101
long <- data.frame(
name = c("Wilbur", "Petunia", "Gregory", "Wilbur", "Petunia", "Gregory", "Wilbur", "Petunia", "Gregory"),
time = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
heartrate = c(67, 80, 64, 56, 90, 50, 70, 67, 10)
)
long
## name time heartrate
## 1 Wilbur 1 67
## 2 Petunia 1 80
## 3 Gregory 1 64
## 4 Wilbur 2 56
## 5 Petunia 2 90
## 6 Gregory 2 50
## 7 Wilbur 3 70
## 8 Petunia 3 67
## 9 Gregory 3 10
We often refer to these two opposite structures as “long” vs. “wide” formats. In the “long” format, you usually have 1 column for the observed variable(s), and the other columns are ID variables.
For the “wide” format, each row is often a site/subject/patient, and you have multiple observation variables containing the same type of data.
Let’s make the data tidy.
## # A tibble: 6 × 3
## issue_importance age percent_agreement
## <chr> <chr> <dbl>
## 1 Very Important 18-29 0.59
## 2 Very Important 30-44 0.66
## 3 Very Important 45-64 0.69
## 4 Very Important 65+ 0.7
## 5 Somewhat Important 18-29 0.28
## 6 Somewhat Important 30-44 0.27
The format here is now tidy. Each column represents a single variable. The separate age columns have been combined into a single column. Each row represents a single cross-sectional observation. Finally, each cell represents a single value.
The tidyverse is a core set of packages that most R programmers use daily for everyday data science tasks. We can load this set of packages all at once with:
library(tidyverse)
Alternatively, we can load individual packages from the tidyverse. For example, we can load the dplyr package.
library(dplyr)
The dplyr package provides many very useful functions for manipulating data frames. These functions will save you time by reducing repetition. As a bonus, you might even find the dplyr grammar easier to read.
In this workshop, we will cover 6 of the most commonly used functions and use pipes (%>%) to combine them.
filter(): pick out observations based on their valuesselect(): pick certain columns by namegroup_by(): group observations based on a variablearrange(): reorder observations based on a variablesummarize(): summarize observations based on some functionmutate(): create new variables with function of existing variablesEach of these verbs works similarly. The first argument is always a data frame object. The additional arguments describe what to do with the data frame using unquoted variable names. The result is a new data frame.
Before diving into each function, let’s give an example of a data wrangling workflow that starts from reading in a dataset and then makes a dataset suitable for a descriptive analysis. This example contains an example of every verb we will talk about in this workshop.
Suppose we have a dataset of individuals’ occupations from different US states. In the data folder for this workshop repository, a dataset called “jobs” matches this description. We can read it in with read_csv(), which is available from the tidyverse. read_csv works exactly like read.csv but is much quicker.
The here function is a way to create file paths relative to the top-level directory.
For a descriptive analysis, we’d like to know the average life expectancy for individuals who are and are not in agriculture for states within each state on the West Coast. To accomplish this task, we will need to filter our list of states, group by each state, create a variable denoting whether the individual works in agriculture or not, and summarize the average age of individuals who do and do not work in agriculture. Finally, we’d like to arrange the states in reverse alphabetical order.
What follows is an example of the workflow just described in the spirit of teaching the whole game
## Data import
jobs <- read_csv(here("data/jobs.csv"))
## New names:
## * `` -> ...1
## Rows: 1000 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): State, Education, Employment, Marital, Political, Name, Occupation
## dbl (3): ...1, Age, Income
## lgl (1): Smokes
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Workflow
jobs %>%
select(State, Occupation, Age)%>%
filter(State %in% c("Alaska","California", "Oregon", "Washington"))%>%
mutate(in_ag = if_else(Occupation == "Farmer", "Yes", "No"))%>%
group_by(State, in_ag)%>%
summarise(average_lifeExp = mean(Age))%>%
arrange(desc(State))
## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.
## # A tibble: 8 × 3
## # Groups: State [4]
## State in_ag average_lifeExp
## <chr> <chr> <dbl>
## 1 Washington No 55
## 2 Washington Yes 64.5
## 3 Oregon No 49.3
## 4 Oregon Yes 58
## 5 California No 53.8
## 6 California Yes 53.6
## 7 Alaska No 64
## 8 Alaska Yes 77
## We can also save the output of our workflow to a new data frame object
wc_lifeExp <- jobs %>%
select(State, Occupation, Age)%>%
filter(State %in% c("Alaska","California", "Oregon", "Washington"))%>%
mutate(in_ag = if_else(Occupation == "Farmer", "Yes", "No"))%>%
group_by(State, in_ag)%>%
summarise(average_lifeExp = mean(Age))%>%
arrange(desc(State))
## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.
The way to think of the pipe, the symbol at the end of each of these lines that is %>%, is that it takes what is on the left side and makes it the first argument of the right side.
The tidyverse is built around pipes because the first argument of every function in the tidyverse is a data frame, which means that the tidyverse all follows the same structure (or API). When doing data tasks, having the same structure turns out to be very useful because it is easy to reason about what a function will do and what it needs.
A pipe operator %>% or |> is a method of chaining together functions. A pipe takes what is on the left-hand side of the pipe and makes it the first argument of the function on the right-hand side by default.
## Pseudo code
x %>% f(y)
## is equivalent to
f(x, y)
## where f() is a function
## Example with the jobs data set
jobs %>%
filter(State == "California")
## is equivalent to
select(jobs, State == "California")
As mentioned, the advantage of the tidyverse is that every function presumes that the first argument is a tidy data object. It is possible to put the left-hand side result as a different argument for the function on the right-hand side by explicitly specifying it. However, a well-written workflow tends to obviate the need to do so.
To show the benefits of the dplyr verbs, we will start by working with a dataset of animal rescue incidents by the London Fire Brigade from 2009-2021. The data provides information on all incidents and information on the location, date and time, number of fire trucks, and cost of the rescue.
animals <- read_csv(here("data/animalRescue.csv"))
## Rows: 8044 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): IncidentNumber, Ward, Borough, PumpCount, PumpHoursTotal, Type, Pro...
## dbl (2): Year, RescueCost
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(animals)
## Rows: 8,044
## Columns: 10
## $ IncidentNumber <chr> "139091", "275091", "2075091", "2872091", "3553091", …
## $ Year <dbl> 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009,…
## $ Ward <chr> "Crystal Palace & Upper Norwood", "Woodside", "Carsha…
## $ Borough <chr> "Croydon", "Croydon", "Sutton", "Hillingdon", "Haveri…
## $ PumpCount <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"…
## $ PumpHoursTotal <chr> "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"…
## $ Type <chr> "Dog", "Fox", "Dog", "Horse", "Rabbit", "Unknown", "D…
## $ PropertyCategory <chr> "Dwelling", "Outdoor Structure", "Outdoor Structure",…
## $ RescueType <chr> "Other animal assistance", "Other animal assistance",…
## $ RescueCost <dbl> 510, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255…
Suppose we are interested in looking at animal rescues from 2018-2020. filter() allows us to subset observations based on their value.
Since we are introducing a verb explicitly for the first time, we will show the verb with and without pipes. The API is the same for every verb.
## Get all observations from the Year 2015
filter(animals, Year == 2015)
## # A tibble: 540 × 10
## IncidentNumber Year Ward Borough PumpCount PumpHoursTotal Type
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 304151 2015 Hampton Richmond Upo… 1 1 Dog
## 2 468151 2015 Chaucer Southwark 1 1 Unkn…
## 3 1074151 2015 Town Enfield 1 1 Cat
## 4 1528151 2015 Kenley Croydon 1 2 Dog
## 5 2217151 2015 Willesden … Brent 1 1 Bird
## 6 2270151 2015 Lea Bridge Hackney 1 1 Cat
## 7 2291151 2015 Falconwood… Bexley 1 1 Cat
## 8 2815151 2015 Teddington Richmond Upo… 1 1 Cat
## 9 3211151 2015 New Adding… Croydon 1 1 Cat
## 10 3409151 2015 Brunel Hillingdon 1 1 Dog
## # … with 530 more rows, and 3 more variables: PropertyCategory <chr>,
## # RescueType <chr>, RescueCost <dbl>
## Get all observations from 2018-2020
filter(animals, Year >= 2018 & Year <=2020)
## # A tibble: 1,968 × 10
## IncidentNumber Year Ward Borough PumpCount PumpHoursTotal Type
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 001058-03012018 2018 Carshalton … Sutton 1 2 Dog
## 2 001474-04012018 2018 Stanley Kensington… 1 1 Bird
## 3 001495-04012018 2018 Dulwich Vil… Southwark 1 1 Bird
## 4 002204-06012018 2018 Lea Bridge Waltham Fo… 1 1 Dog
## 5 002369-06012018 2018 St. Andrew's Havering 1 1 Bird
## 6 002832-07012018 2018 Garden Subu… Barnet 1 1 Bird
## 7 003760-09012018 2018 Northcote Wandsworth 1 1 Bird
## 8 004039-10012018 2018 Haverstock Camden 1 1 Cat
## 9 004051-10012018 2018 Ponders End Enfield 1 3 Horse
## 10 004435-11012018 2018 Knight's Hi… Lambeth 1 1 Cat
## # … with 1,958 more rows, and 3 more variables: PropertyCategory <chr>,
## # RescueType <chr>, RescueCost <dbl>
The first argument is the name of the data frame we want to use, in this case, “animals.” The arguments after are the expressions that filter the data frame.
Here is the same code with a pipe and a neat shortcut
animals %>%
filter(Year == 2015)
## # A tibble: 540 × 10
## IncidentNumber Year Ward Borough PumpCount PumpHoursTotal Type
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 304151 2015 Hampton Richmond Upo… 1 1 Dog
## 2 468151 2015 Chaucer Southwark 1 1 Unkn…
## 3 1074151 2015 Town Enfield 1 1 Cat
## 4 1528151 2015 Kenley Croydon 1 2 Dog
## 5 2217151 2015 Willesden … Brent 1 1 Bird
## 6 2270151 2015 Lea Bridge Hackney 1 1 Cat
## 7 2291151 2015 Falconwood… Bexley 1 1 Cat
## 8 2815151 2015 Teddington Richmond Upo… 1 1 Cat
## 9 3211151 2015 New Adding… Croydon 1 1 Cat
## 10 3409151 2015 Brunel Hillingdon 1 1 Dog
## # … with 530 more rows, and 3 more variables: PropertyCategory <chr>,
## # RescueType <chr>, RescueCost <dbl>
animals %>%
filter(Year >= 2018 & Year <= 2020)
## # A tibble: 1,968 × 10
## IncidentNumber Year Ward Borough PumpCount PumpHoursTotal Type
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 001058-03012018 2018 Carshalton … Sutton 1 2 Dog
## 2 001474-04012018 2018 Stanley Kensington… 1 1 Bird
## 3 001495-04012018 2018 Dulwich Vil… Southwark 1 1 Bird
## 4 002204-06012018 2018 Lea Bridge Waltham Fo… 1 1 Dog
## 5 002369-06012018 2018 St. Andrew's Havering 1 1 Bird
## 6 002832-07012018 2018 Garden Subu… Barnet 1 1 Bird
## 7 003760-09012018 2018 Northcote Wandsworth 1 1 Bird
## 8 004039-10012018 2018 Haverstock Camden 1 1 Cat
## 9 004051-10012018 2018 Ponders End Enfield 1 3 Horse
## 10 004435-11012018 2018 Knight's Hi… Lambeth 1 1 Cat
## # … with 1,958 more rows, and 3 more variables: PropertyCategory <chr>,
## # RescueType <chr>, RescueCost <dbl>
## which is the same as
animals %>%
filter(between(Year, 2018, 2020))
## # A tibble: 1,968 × 10
## IncidentNumber Year Ward Borough PumpCount PumpHoursTotal Type
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 001058-03012018 2018 Carshalton … Sutton 1 2 Dog
## 2 001474-04012018 2018 Stanley Kensington… 1 1 Bird
## 3 001495-04012018 2018 Dulwich Vil… Southwark 1 1 Bird
## 4 002204-06012018 2018 Lea Bridge Waltham Fo… 1 1 Dog
## 5 002369-06012018 2018 St. Andrew's Havering 1 1 Bird
## 6 002832-07012018 2018 Garden Subu… Barnet 1 1 Bird
## 7 003760-09012018 2018 Northcote Wandsworth 1 1 Bird
## 8 004039-10012018 2018 Haverstock Camden 1 1 Cat
## 9 004051-10012018 2018 Ponders End Enfield 1 3 Horse
## 10 004435-11012018 2018 Knight's Hi… Lambeth 1 1 Cat
## # … with 1,958 more rows, and 3 more variables: PropertyCategory <chr>,
## # RescueType <chr>, RescueCost <dbl>
As with the examples we have seen before, we take the animals data frame and then pass it via the pipe to the filter argument. Because the pipe puts the data frame as the first argument for every tidyverse function, we only have to specify the subsequent arguments. between() is a dplyr function that is a shortcut for x >= left & x <= right where x is our variable of interest.
You can extend filter() with logical conditions:
| Symbol | Meaning |
|---|---|
| < | less than |
| > | greater than |
| == | equal to |
| <= | less than or equal to |
| >= | greater than or equal to |
| != | not equal to |
| %in% | group membership |
| is.na | is NA |
| !is.na | is not NA |
| &, |, !, xor, any, all | Boolean operators |
Create a new data frame object called cats_and_dogs that filters the animals data frame for all observations for which the animal type is a cat or a dog. Use the pipe operator to chain the functions together.
Sometimes we want to filter multiple conditions at once. For example, we might be interested in all rescues of birds or horses after 2011. We use the , in our filter expression to accomplish this.
birds_and_horses <- animals %>%
filter(Type == "Bird"|Type == "Horse", Year > 2011)
glimpse(birds_and_horses)
## Rows: 1,487
## Columns: 10
## $ IncidentNumber <chr> "48122", "1454122", "2677122", "5808121", "6451121", …
## $ Year <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,…
## $ Ward <chr> "Wandle Valley", "Noel Park", "Winchmore Hill", "Clis…
## $ Borough <chr> "Sutton", "Haringey", "Enfield", "Hackney", "Bexley",…
## $ PumpCount <chr> "1", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1"…
## $ PumpHoursTotal <chr> "2", "1", "3", "1", "1", "1", "1", "1", "1", "1", "3"…
## $ Type <chr> "Horse", "Bird", "Bird", "Bird", "Horse", "Bird", "Bi…
## $ PropertyCategory <chr> "Outdoor", "Non Residential", "Outdoor", "Dwelling", …
## $ RescueType <chr> "Other animal assistance", "Animal rescue from height…
## $ RescueCost <dbl> 520, 260, 780, 260, 260, 260, 260, 260, 260, 260, 780…
We can put multiple filter functions in a row in a piped workflow, though it does not look as readable.
birds_and_horses2 <- animals %>%
filter(Type == "Bird"|Type == "Horse")%>%
filter(Year > 2011)
## These two data frames are the same
all.equal(birds_and_horses, birds_and_horses2)
## [1] TRUE
Now we have a data frame of just rescues of cats and dogs. Suppose we are only interested in a few variables of the data frame. We can use the select() function to keep only the variables we select.
To be concrete, suppose we just want to see the Year, Type, and PropertyCategory variables.
cats_and_dogs <- animals %>%
filter(Type == "Cat"|Type == "Dog")
cats_and_dogs_short <- cats_and_dogs %>%
select(Year, Type, PropertyCategory)
If we open up cats_and_dogs_short, we’ll see that it only contains the Year, Type, and PropertyCategory columns.
Both the select() and filter() functions subset the data frame. The difference is that select() extracts certain columns, while filter extracts certain rows.
When using select and filter together, the order of operations is very important. If we used select() first, the filter() function can only filter on variables that we selected.
We can also select observation variables using: - variable indices - variable names (without quotes) - x:z to select all variables between x and z - -y to exclude y - starts_with(x, ignore.case = TRUE): all names that starts with x - ends_with(x, ignore.case = TRUE): all names that ends with x - contains(x, ignore.case = TRUE): all names that contain x
A common task you’ll encounter when working with data is running calculations on different groups within the data. For instance, what if we wanted to calculate the total number of cat rescues for each borough?
The abstract problem we’re encountering here is known as “split-apply-combine”:
We want to split our data into groups (in this case boroughs), apply some calculations on that group (in this case count), and then combine the results together afterward. This pattern is common in data wrangling and a cornerstone of the dplyr approach.
A grouped_df can be thought of as a list where each item in the list is a data.frame containing only the rows that correspond to the particular value continent (at least in the example above).
Let’s demonstrate the benefits of group_by by answering the question, “What are total number of cat rescues for each borough?”
borough_cat_rescues <- animals %>%
## group all observations by Boroughs
group_by(Borough)%>%
## get just the observations of cat rescues
filter(Type == "Cat")%>%
## new function! count the unique values of one or more variables
count(Type)
head(borough_cat_rescues)
## # A tibble: 6 × 3
## # Groups: Borough [6]
## Borough Type n
## <chr> <chr> <int>
## 1 Barking And Dagenham Cat 105
## 2 Barnet Cat 127
## 3 Bexley Cat 84
## 4 Brent Cat 114
## 5 Bromley Cat 112
## 6 Camden Cat 137
We pass ungroup() to our data to ungroup our dataset.
animals %>%
group_by(Borough)%>%
filter(Type == "Cat")%>%
count(Type)%>%
ungroup()
## # A tibble: 32 × 3
## Borough Type n
## <chr> <chr> <int>
## 1 Barking And Dagenham Cat 105
## 2 Barnet Cat 127
## 3 Bexley Cat 84
## 4 Brent Cat 114
## 5 Bromley Cat 112
## 6 Camden Cat 137
## 7 Croydon Cat 161
## 8 Ealing Cat 131
## 9 Enfield Cat 146
## 10 Greenwich Cat 132
## # … with 22 more rows
We now have cat rescues by boroughs, but our quick glimpse has them in a strange order. To get the observations from smallest to largest, we can use arrange() and pass the variable name of interest.
borough_cat_rescues %>%
arrange(n)
## # A tibble: 32 × 3
## # Groups: Borough [32]
## Borough Type n
## <chr> <chr> <int>
## 1 Kingston Upon Thames Cat 58
## 2 Harrow Cat 66
## 3 Merton Cat 69
## 4 Havering Cat 72
## 5 Sutton Cat 79
## 6 Bexley Cat 84
## 7 Richmond Upon Thames Cat 86
## 8 Hillingdon Cat 87
## 9 Hounslow Cat 93
## 10 Hammersmith And Fulham Cat 103
## # … with 22 more rows
If we want to reverse the order, we need to wrap the variable name in desc() like so
borough_cat_rescues %>%
arrange(desc(n))
## # A tibble: 32 × 3
## # Groups: Borough [32]
## Borough Type n
## <chr> <chr> <int>
## 1 Newham Cat 190
## 2 Haringey Cat 185
## 3 Lambeth Cat 175
## 4 Southwark Cat 174
## 5 Hackney Cat 162
## 6 Croydon Cat 161
## 7 Tower Hamlets Cat 155
## 8 Lewisham Cat 153
## 9 Enfield Cat 146
## 10 Wandsworth Cat 139
## # … with 22 more rows
Let’s put the verbs we’ve learned so far into action all at once. Answer the question “What are the total number of dog rescues for each borough?” by making a data frame called borough_dog_rescues and arranging the rows by borough in alphabetical order.
## Your Code here
summarise() or summarize() collapses a data frame to a single row.
animals %>%
summarise(avg_rescue_cost = mean(RescueCost, na.rm = T))
## # A tibble: 1 × 1
## avg_rescue_cost
## <dbl>
## 1 358.
## Note that summarise can also be written with a z
animals %>%
summarize(avg_rescue_cost = mean(RescueCost, na.rm = T))
## # A tibble: 1 × 1
## avg_rescue_cost
## <dbl>
## 1 358.
summarise() becomes extremely useful when paired with group_by group_by(). We just saw that our last block gave the average cost across our data set. By using the group_by() function, we split our original data frame into multiple pieces, which we then use to run functions (e.g. mean()) within summarize().
Suppose instead we are interested in the average cost of a rescue for each borough.
animals %>%
group_by(Borough)%>%
summarise(avg_rescue_cost = mean(RescueCost, na.rm=T))%>%
arrange(desc(avg_rescue_cost))
## # A tibble: 37 × 2
## Borough avg_rescue_cost
## <chr> <dbl>
## 1 Epping Forest 650.
## 2 Bexley 431.
## 3 Bromley 404.
## 4 Enfield 399.
## 5 Havering 389.
## 6 Hillingdon 386.
## 7 Tower Hamlets 381.
## 8 Waltham Forest 379.
## 9 Hounslow 377.
## 10 Camden 374.
## # … with 27 more rows
We are not limited to grouping by just one variable. We can group by multiple variables. Let’s answer the question, “What’s the average cost by borough year?”
animals %>%
group_by(Borough, Year)%>%
summarise(avg_rescue_cost = mean(RescueCost, na.rm=T))%>%
arrange(Borough)
## `summarise()` has grouped output by 'Borough'. You can override using the `.groups` argument.
## # A tibble: 431 × 3
## # Groups: Borough [37]
## Borough Year avg_rescue_cost
## <chr> <dbl> <dbl>
## 1 Barking And Dagenham 2009 279.
## 2 Barking And Dagenham 2010 312
## 3 Barking And Dagenham 2011 329.
## 4 Barking And Dagenham 2012 260
## 5 Barking And Dagenham 2013 303.
## 6 Barking And Dagenham 2014 310.
## 7 Barking And Dagenham 2015 340.
## 8 Barking And Dagenham 2016 359.
## 9 Barking And Dagenham 2017 426.
## 10 Barking And Dagenham 2018 333.
## # … with 421 more rows
That is already quite powerful, but it gets even better! We are not limited to defining one new variable in summarize(). Suppose we want to know the average deviation within each borough year.
animals %>%
group_by(Borough, Year)%>%
summarise(avg_rescue_cost = mean(RescueCost, na.rm=T),
sd_rescue_cost = sd(RescueCost, na.rm = T))%>%
arrange(Borough)
## `summarise()` has grouped output by 'Borough'. You can override using the `.groups` argument.
## # A tibble: 431 × 4
## # Groups: Borough [37]
## Borough Year avg_rescue_cost sd_rescue_cost
## <chr> <dbl> <dbl> <dbl>
## 1 Barking And Dagenham 2009 279. 104.
## 2 Barking And Dagenham 2010 312 201.
## 3 Barking And Dagenham 2011 329. 208.
## 4 Barking And Dagenham 2012 260 0
## 5 Barking And Dagenham 2013 303. 77.5
## 6 Barking And Dagenham 2014 310. 67.9
## 7 Barking And Dagenham 2015 340. 107.
## 8 Barking And Dagenham 2016 359. 103.
## 9 Barking And Dagenham 2017 426. 220.
## 10 Barking And Dagenham 2018 333. 1.39
## # … with 421 more rows
You will receive a message that says “summarise() has grouped output by ‘Borough’. You can override using the .groups argument.” This is an experimental feature in dplyr and does not change the results of our computation at all. The computation that is identical but leads to R not sending that message is the following:
animals %>%
group_by(Borough, Year)%>%
summarise(avg_rescue_cost = mean(RescueCost, na.rm=T),
sd_rescue_cost = sd(RescueCost, na.rm = T),
.groups = "drop_last")%>%
arrange(Borough)
## # A tibble: 431 × 4
## # Groups: Borough [37]
## Borough Year avg_rescue_cost sd_rescue_cost
## <chr> <dbl> <dbl> <dbl>
## 1 Barking And Dagenham 2009 279. 104.
## 2 Barking And Dagenham 2010 312 201.
## 3 Barking And Dagenham 2011 329. 208.
## 4 Barking And Dagenham 2012 260 0
## 5 Barking And Dagenham 2013 303. 77.5
## 6 Barking And Dagenham 2014 310. 67.9
## 7 Barking And Dagenham 2015 340. 107.
## 8 Barking And Dagenham 2016 359. 103.
## 9 Barking And Dagenham 2017 426. 220.
## 10 Barking And Dagenham 2018 333. 1.39
## # … with 421 more rows
Create a data frame that considers the average rescue cost, median rescue cost, and standard deviation for bird rescues for each property category year. Save the result of your workflow into a data frame object called bird_stats
mutate() creates new variables to the same data frame that you pass into it.
To demonstrate mutate(), load the gapminder-FiveYearData.csv into RStudio as a data frame object called gapminder
gapminder <- read_csv(here("data/gapminder-FiveYearData.csv"))
## Rows: 1704 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, continent
## dbl (4): year, pop, lifeExp, gdpPercap
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ year <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ pop <dbl> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi…
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
The gapminder dataset has six variables and is in tidy form. One of those variables is lifeExp. Imagine we are interested in comparing the difference between a country’s life expectancy and the average continent-year-wide life expectancy by year. Such a query requires us to make two new variables. The first variable is the mean continent-year life expectancy. The second variable is the difference between life expectancy for a country and the continent’s life expectancy for every year.
mutate() easily handles both cases. We can also use mutate() to create new variables before (or even after) summarizing information.
gapminder_xtra_vars <- gapminder %>%
group_by(year, continent)%>%
mutate(continent_lifeExp = mean(lifeExp, na.rm = T),
diff_lifeExp = lifeExp - continent_lifeExp)
Note that mutate() does not require a group_by(). We can simply add a new column, either as a function of existing columns or through a new computation. Here’s an example of doing both.
gap_mutate_no_group <- gapminder %>%
mutate(gdp_billion = (gdpPercap*pop)/10^9,
new_comp = 2+2)
A final useful function to know is the how-to count observations within a mutate. dplyr provides the handy n() function for this computation. Here’s an example.
gap_count <- gapminder %>%
filter(year == 2007)%>%
group_by(continent)%>%
mutate(numCountries = n())%>%
# distinct works like select() except it returns only rows
# that are unique
distinct(continent, numCountries)%>%
arrange(desc(numCountries))
Add a column to the gapminder dataset that contains the continent’s total population of each observation in a given year. For example, if the first observation was Afghanistan in 1952, the new column would contain the population of Asia in 1952.
Use dplyr to: (a) add a column called gdpPercap_diff that contains the difference between the observation’s gdpPercap and the mean gdpPercap of the continent in that year, (b) arrange the data frame by the column you just created, in descending order (so that the relatively richest country/years are listed first)
Until now, we have been using the nicely formatted, original gapminder dataset. Now, let’s start with the wide-format version of the gapminder dataset.
gap_wide <- read_csv(here("data/gapminder_wide.csv"))
## Rows: 142 Columns: 38
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): continent, country
## dbl (36): gdpPercap_1952, gdpPercap_1957, gdpPercap_1962, gdpPercap_1967, gd...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(gap_wide)
## # A tibble: 6 × 38
## continent country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Africa Algeria 2449. 3014. 2551. 3247.
## 2 Africa Angola 3521. 3828. 4269. 5523.
## 3 Africa Benin 1063. 960. 949. 1036.
## 4 Africa Botswana 851. 918. 984. 1215.
## 5 Africa Burkina… 543. 617. 723. 795.
## 6 Africa Burundi 339. 380. 355. 413.
## # … with 32 more variables: gdpPercap_1972 <dbl>, gdpPercap_1977 <dbl>,
## # gdpPercap_1982 <dbl>, gdpPercap_1987 <dbl>, gdpPercap_1992 <dbl>,
## # gdpPercap_1997 <dbl>, gdpPercap_2002 <dbl>, gdpPercap_2007 <dbl>,
## # lifeExp_1952 <dbl>, lifeExp_1957 <dbl>, lifeExp_1962 <dbl>,
## # lifeExp_1967 <dbl>, lifeExp_1972 <dbl>, lifeExp_1977 <dbl>,
## # lifeExp_1982 <dbl>, lifeExp_1987 <dbl>, lifeExp_1992 <dbl>,
## # lifeExp_1997 <dbl>, lifeExp_2002 <dbl>, lifeExp_2007 <dbl>, …
The first step towards getting our nice, tidydata format is to convert from the wide to the long format.
The function pivot_longer() will “gather” the observation variables into a single variable.
gap_long <- gap_wide %>%
pivot_longer(
cols = 3:38,
names_to = "obstype_year",
values_to = "obs_values"
)
head(gap_long)
## # A tibble: 6 × 4
## continent country obstype_year obs_values
## <chr> <chr> <chr> <dbl>
## 1 Africa Algeria gdpPercap_1952 2449.
## 2 Africa Algeria gdpPercap_1957 3014.
## 3 Africa Algeria gdpPercap_1962 2551.
## 4 Africa Algeria gdpPercap_1967 3247.
## 5 Africa Algeria gdpPercap_1972 4183.
## 6 Africa Algeria gdpPercap_1977 4910.
Notice that we put three arguments into the pivot_longer function:
3:38, signaling columns 3 through 38) into one variable. Notice that we want to keep columns 1 and 2, as these are considered “ID” variables.An alternative way to do indices is to type in the column names of interest explicitly. Here is an example where we exclude the first two columns.
makeLong <- gap_wide %>%
pivot_longer(
cols = c(-continent, -country),
names_to = "obstype_year",
values_to = "obs_values"
)
head(makeLong)
## # A tibble: 6 × 4
## continent country obstype_year obs_values
## <chr> <chr> <chr> <dbl>
## 1 Africa Algeria gdpPercap_1952 2449.
## 2 Africa Algeria gdpPercap_1957 3014.
## 3 Africa Algeria gdpPercap_1962 2551.
## 4 Africa Algeria gdpPercap_1967 3247.
## 5 Africa Algeria gdpPercap_1972 4183.
## 6 Africa Algeria gdpPercap_1977 4910.
the name of the new column for the new ID variable (obstype_year),
the name for the new amalgamated observation variable (obs_value)
Alternatively, we can use the column structure in the gap_wide data frame to do the same thing with the starts_with function.
# with the starts_with() function
gap_long <- gap_wide %>%
pivot_longer(
cols = c(starts_with('gdpPercap'),
starts_with('lifeExp'),
starts_with('pop')),
names_to = "obstype_year",
values_to = "obs_values"
)
head(gap_long)
## # A tibble: 6 × 4
## continent country obstype_year obs_values
## <chr> <chr> <chr> <dbl>
## 1 Africa Algeria gdpPercap_1952 2449.
## 2 Africa Algeria gdpPercap_1957 3014.
## 3 Africa Algeria gdpPercap_1962 2551.
## 4 Africa Algeria gdpPercap_1967 3247.
## 5 Africa Algeria gdpPercap_1972 4183.
## 6 Africa Algeria gdpPercap_1977 4910.
However you choose to do it, notice that the output collapses all of the measure variables into two columns: one containing the new ID variable, the other containing the observation value for that row.
You’ll notice that in our long dataset, obstype_year actually contains two pieces of information, the observation type (pop, lifeExp, or gdpPercap) and the year.
We can use the separate() function to split the character strings into multiple variables:
gap_long_sep <- gap_long %>%
separate(obstype_year, into = c('obs_type','year'), sep = "_") %>%
mutate(year = as.integer(year))
head(gap_long_sep)
## # A tibble: 6 × 5
## continent country obs_type year obs_values
## <chr> <chr> <chr> <int> <dbl>
## 1 Africa Algeria gdpPercap 1952 2449.
## 2 Africa Algeria gdpPercap 1957 3014.
## 3 Africa Algeria gdpPercap 1962 2551.
## 4 Africa Algeria gdpPercap 1967 3247.
## 5 Africa Algeria gdpPercap 1972 4183.
## 6 Africa Algeria gdpPercap 1977 4910.
The opposite of pivot_longer is pivot_wider. It spreads our observation variables back out to make a wider table. We can use this function to spread our gap_long() to the original format.
gap_original <- gap_long_sep %>%
pivot_wider(
names_from = obs_type,
values_from = obs_values)
head(gap_original)
## # A tibble: 6 × 6
## continent country year gdpPercap lifeExp pop
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Africa Algeria 1952 2449. 43.1 9279525
## 2 Africa Algeria 1957 3014. 45.7 10270856
## 3 Africa Algeria 1962 2551. 48.3 11000948
## 4 Africa Algeria 1967 3247. 51.4 12760499
## 5 Africa Algeria 1972 4183. 54.5 14760787
## 6 Africa Algeria 1977 4910. 58.0 17152804
We have provided the code below to answer Exercise 5. Ungroup the results, then subset them to select only the country, year, and gdpPercap_diff columns. Put it in a wide format so that countries are rows and years are columns.
Now, turn the data frame above back into the long format with three columns: country, year, and gdpPercap_diff.